True, if the response variable has correlation 0 with all the predictor variables, then the only predictor would be the intercept, with slope 0. In the simple linear regression case, this would be a horizontal line going through the data
True, even if predictor variables are perfectly correlated, the model can still be a good fit for the data. Thinking geometrically, the \(dim(X) <p\) in the case of multicollineairty. However, since the space exists, we can still fit the data.
True, because linear transformations have no effect on our ANOVA table, thus not changing our coefficients of multiple determination.
True, since the p-1 t-tests are not equivalent to testing whether there is a regression relation between Y and the set of X variables (as tested by the F test). When we have multicollinearity, this can be the case.
True, we can have a large amount of variables that are uncorrelated with each other, which individually can be significant but create a not significant p-value as a whole.
library(ggplot2)
library(GGally)
library(plotly)
property <- read.table("property.txt")
colnames(property) <-
c("Ren.Rate", "Age", "Exp", "Vac.Rate", "Sq.Foot")
property
ggplotly(ggplot(data = property, aes(x = Age, y = Ren.Rate)) + geom_point())
We can see from the plot that there is no tell of a linear relationship between the age of a property and itβs rental rate.
We have model equation: \[ Y_i = \beta_0 + \beta_1\tilde{X_{i1}} + \beta_2X_{i2} + \beta_4X_{i4} + \beta_1\tilde{X_{i1}^2} \] NOTE: can fit X1 for X1 tilde
property["AgeCent"] <- property$Age - mean(property$Age)
property["AgeSq"] <- property$AgeCent ^ 2
polyModel <-
lm(Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
summary(polyModel)
##
## Call:
## lm(formula = Ren.Rate ~ AgeCent + AgeSq + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89596 -0.62547 -0.08907 0.62793 2.68309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.019e+01 6.709e-01 15.188 < 2e-16 ***
## AgeCent -1.818e-01 2.551e-02 -7.125 5.10e-10 ***
## AgeSq 1.415e-02 5.821e-03 2.431 0.0174 *
## Exp 3.140e-01 5.880e-02 5.340 9.33e-07 ***
## Sq.Foot 8.046e-06 1.267e-06 6.351 1.42e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.097 on 76 degrees of freedom
## Multiple R-squared: 0.6131, Adjusted R-squared: 0.5927
## F-statistic: 30.1 on 4 and 76 DF, p-value: 5.203e-15
#Plotting Observations Against Fitted Values
ggplotly(
ggplot() + aes(x = polyModel$fitted.values, y = property$Ren.Rate) + geom_point() + labs(x = "Fitted Values", y = "Observations", title = "Observartions against Fitted Values")
)
We have the regression function: \[ Y_i = 10.19 - 0.182X_{i1} + 0.314X_{i2} + 0.00008X_{i4} + 0.014X_{i1}^2 \] We find that our model is a good fit. We have a relatively good \(R^2_{adj}\) as well as fairly linear Observations against Fitted Values plot.
# Model 2
model2 <- lm(Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
summary(model2)
##
## Call:
## lm(formula = Ren.Rate ~ Age + Exp + Sq.Foot, data = property)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0620 -0.6437 -0.1013 0.5672 2.9583
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.237e+01 4.928e-01 25.100 < 2e-16 ***
## Age -1.442e-01 2.092e-02 -6.891 1.33e-09 ***
## Exp 2.672e-01 5.729e-02 4.663 1.29e-05 ***
## Sq.Foot 8.178e-06 1.305e-06 6.265 1.97e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.132 on 77 degrees of freedom
## Multiple R-squared: 0.583, Adjusted R-squared: 0.5667
## F-statistic: 35.88 on 3 and 77 DF, p-value: 1.295e-14
We find that both the \(R^2\) and \(R^2_{adj}\) are higher in the quadratic model than the Model 2. The \(R^2\) for Model 2 is \(0.583\) and \(0.6131\) for the quadratic model. The \(R^2_{adj}\) for Model 2 is \(0.5667\) and \(0.5927\) for the quadratic model. This would lead us to conclude that the quadratic model is a better fit than Model 2.
To test our full model versus our reduced model, we have: \[ H_0: \beta_j = 0\ \text{for all} \ j\in \mathbf J\\ H_a: \text{not all} \ \beta_j: \ j\in \mathbf J\\ \] With test statistic and null distribution: \[ F^* = \frac{\frac{SSE(R)-SSE(F)}{df_R - df_F}}{\frac{SSE(F)}{df_F}} \\ F^* \sim F_{(1- \alpha, df_R - df_F, df_F)} \]
We reject \(H_0\) if \(F^* > F_{(1- \alpha, df_R - df_F, df_F)})\).
#Find crtical value
qf(1 - 0.05, 77-76, 77)
## [1] 3.965094
anova(polyModel, model2)
Given that our value for our \(F^*\) is \(5.9078\), we reject \(H_0\) and conclude that our quadratic term is significant in the model at \(\alpha = 0.05\).
#Our prediction for model 2
predict(model2, data.frame(Age = 4, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 15.11985 12.09134 18.14836
# Our prediction for quadratic model
predict(polyModel, data.frame(AgeCent = 4, AgeSq = 16, Exp = 10, Sq.Foot = 80000), interval = "prediction", level = 0.99)
## fit lwr upr
## 1 13.47259 10.47873 16.46645
We can see that when we predict with a quadratic model, we get a lower valued interval than that of the prediction with Model 2.